72 research outputs found
Regularized Newton Method with Global Convergence
We present a Newton-type method that converges fast from any initialization
and for arbitrary convex objectives with Lipschitz Hessians. We achieve this by
merging the ideas of cubic regularization with a certain adaptive
Levenberg--Marquardt penalty. In particular, we show that the iterates given by
, where is a constant, converge
globally with a rate. Our method is the first
variant of Newton's method that has both cheap iterations and provably fast
global convergence. Moreover, we prove that locally our method converges
superlinearly when the objective is strongly convex. To boost the method's
performance, we present a line search procedure that does not need
hyperparameters and is provably efficient.Comment: 21 pages, 2 figure
Adaptive Proximal Gradient Method for Convex Optimization
In this paper, we explore two fundamental first-order algorithms in convex
optimization, namely, gradient descent (GD) and proximal gradient method
(ProxGD). Our focus is on making these algorithms entirely adaptive by
leveraging local curvature information of smooth functions. We propose adaptive
versions of GD and ProxGD that are based on observed gradient differences and,
thus, have no added computational costs. Moreover, we prove convergence of our
methods assuming only local Lipschitzness of the gradient. In addition, the
proposed versions allow for even larger stepsizes than those initially
suggested in [MM20]
Learning-Rate-Free Learning by D-Adaptation
The speed of gradient descent for convex Lipschitz functions is highly
dependent on the choice of learning rate. Setting the learning rate to achieve
the optimal convergence rate requires knowing the distance D from the initial
point to the solution set. In this work, we describe a single-loop method, with
no back-tracking or line searches, which does not require knowledge of yet
asymptotically achieves the optimal rate of convergence for the complexity
class of convex Lipschitz functions. Our approach is the first parameter-free
method for this class without additional multiplicative log factors in the
convergence rate. We present extensive experiments for SGD and Adam variants of
our method, where the method automatically matches hand-tuned learning rates
across more than a dozen diverse machine learning problems, including
large-scale vision and language problems. Our method is practical, efficient
and requires no additional function value or gradient evaluations each step. An
open-source implementation is available
(https://github.com/facebookresearch/dadaptation)
Prodigy: An Expeditiously Adaptive Parameter-Free Learner
We consider the problem of estimating the learning rate in adaptive methods,
such as Adagrad and Adam. We describe two techniques, Prodigy and Resetting, to
provably estimate the distance to the solution , which is needed to set the
learning rate optimally. Our techniques are modifications of the D-Adaptation
method for learning-rate-free learning. Our methods improve upon the
convergence rate of D-Adaptation by a factor of , where
is the initial estimate of . We test our methods on 12 common
logistic-regression benchmark datasets, VGG11 and ResNet-50 training on
CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on
Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT
transformer training on BookWiki. Our experimental results show that our
approaches consistently outperform D-Adaptation and reach test accuracy values
close to that of hand-tuned Adam
- β¦